Problem Statement:

The purpose of the case study is to classify a given silhouette as one of four different types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

The points distribution for this case is as follows:

  1. Data pre-processing - Understand the data and treat missing values (Use box plot), outliers (5 points) 2. Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why (5 points) 3. Use PCA from scikit learn and elbow plot to find out reduced number of dimension (which covers more than 95% of the variance) - 10 points
  2. Use Support vector machines and use grid search (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy. (10 points)

Attribute Information:

ATTRIBUTES

COMPACTNESS (average perim)**2/area

CIRCULARITY (average radius)**2/area

DISTANCE CIRCULARITY area/(av.distance from border)**2

RADIUS RATIO (max.rad-min.rad)/av.radius

PR.AXIS ASPECT RATIO (minor axis)/(major axis)

MAX.LENGTH ASPECT RATIO (length perp. max length)/(max length)

SCATTER RATIO (inertia about minor axis)/(inertia about major axis)

ELONGATEDNESS area/(shrink width)**2

PR.AXIS RECTANGULARITY area/(pr.axis length*pr.axis width)

MAX.LENGTH RECTANGULARITY area/(max.length*length perp. to this)

SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS

SCALED VARIANCE (2nd order moment about major axis)/area ALONG MINOR AXIS

SCALED RADIUS OF GYRATION (mavar+mivar)/area

SKEWNESS ABOUT (3rd order moment about major axis)/sigma_min**3 MAJOR AXIS

SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**3 MINOR AXIS

KURTOSIS ABOUT (4th order moment about major axis)/sigma_min**4 MINOR AXIS

KURTOSIS ABOUT (4th order moment about minor axis)/sigma_maj**4 MAJOR AXIS

HOLLOWS RATIO (area of hollows)/(area of bounding polygon)

Where sigma_maj2 is the variance along the major axis and sigma_min2 is the variance along the minor axis, and

area of hollows= area of bounding poly-area of object

The area of the bounding polygon is found as a side result of the computation to find the maximum length. Each individual length computation yields a pair of calipers to the object orientated at every 5 degrees. The object is propagated into an image containing the union of these calipers to obtain an image of the bounding polygon.

NUMBER OF CLASSES

4 OPEL, SAAB, BUS, VAN

Import libraries and Read the dataset using function .dropna() - to avoid dealing with NAs

In [50]:
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [51]:
# Importing Data file
df = pd.read_csv('vehicle.csv').dropna()
df.shape
Out[51]:
(813, 19)
In [52]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 813 entries, 0 to 845
Data columns (total 19 columns):
compactness                    813 non-null int64
circularity                    813 non-null float64
distance_circularity           813 non-null float64
radius_ratio                   813 non-null float64
pr.axis_aspect_ratio           813 non-null float64
max.length_aspect_ratio        813 non-null int64
scatter_ratio                  813 non-null float64
elongatedness                  813 non-null float64
pr.axis_rectangularity         813 non-null float64
max.length_rectangularity      813 non-null int64
scaled_variance                813 non-null float64
scaled_variance.1              813 non-null float64
scaled_radius_of_gyration      813 non-null float64
scaled_radius_of_gyration.1    813 non-null float64
skewness_about                 813 non-null float64
skewness_about.1               813 non-null float64
skewness_about.2               813 non-null float64
hollows_ratio                  813 non-null int64
class                          813 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 123.9+ KB
In [53]:
df.isna().sum()
Out[53]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64
In [54]:
# 5 point summary
df.describe(include='all')
Out[54]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
count 813.000000 813.000000 813.00000 813.000000 813.000000 813.000000 813.000000 813.00000 813.000000 813.000000 813.000000 813.000000 813.000000 813.000000 813.000000 813.000000 813.000000 813.000000 813
unique NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3
top NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN car
freq NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 413
mean 93.656827 44.803198 82.04305 169.098401 61.774908 8.599016 168.563346 40.98893 20.558426 147.891759 188.377614 438.382534 174.252153 72.399754 6.351784 12.687577 188.979090 195.729397 NaN
std 8.233751 6.146659 15.78307 33.615402 7.973000 4.677174 33.082186 7.80338 2.573184 14.504648 31.165873 175.270368 32.332161 7.475994 4.921476 8.926951 6.153681 7.398781 NaN
min 73.000000 33.000000 40.00000 104.000000 47.000000 2.000000 112.000000 26.00000 17.000000 118.000000 130.000000 184.000000 109.000000 59.000000 0.000000 0.000000 176.000000 181.000000 NaN
25% 87.000000 40.000000 70.00000 141.000000 57.000000 7.000000 146.000000 33.00000 19.000000 137.000000 167.000000 318.000000 149.000000 67.000000 2.000000 6.000000 184.000000 191.000000 NaN
50% 93.000000 44.000000 79.00000 167.000000 61.000000 8.000000 157.000000 43.00000 20.000000 146.000000 179.000000 364.000000 173.000000 71.000000 6.000000 11.000000 189.000000 197.000000 NaN
75% 100.000000 49.000000 98.00000 195.000000 65.000000 10.000000 198.000000 46.00000 23.000000 159.000000 217.000000 586.000000 198.000000 75.000000 9.000000 19.000000 193.000000 201.000000 NaN
max 119.000000 59.000000 112.00000 333.000000 138.000000 55.000000 265.000000 61.00000 29.000000 188.000000 320.000000 1018.000000 268.000000 135.000000 22.000000 41.000000 206.000000 211.000000 NaN
In [55]:
df.head()
Out[55]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus

Data pre-processing - Understand the data and treat missing values (Use box plot), outliers (5 points)

In [56]:
#Since the variable is categorical, you can use value_counts function
pd.value_counts(df['class'])
Out[56]:
car    413
bus    205
van    195
Name: class, dtype: int64
In [57]:
import matplotlib.pyplot as plt
%matplotlib inline
pd.value_counts(df["class"]).plot(kind="bar")
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0xec8cbb0>

As the model was not deemed to distinguish between the two cars,Saab 9000 and an Opel Manta 400.The number of entries classified under the class 'car' was double the number of entries that were classified under bus and van

In [58]:
sns.pairplot(df,hue='class')
Out[58]:
<seaborn.axisgrid.PairGrid at 0xfe917f0>

Here,we see that when we use hue in order to distinguish between the classes,we see that the there is a clear distinction in the peaks of some of the attributes for the three different classes and hence these can be helpful in classification of the entries.

We can also see high correlation in many of the attributes hence reducing the dimensionality of the data set by dropping highly co-related data point or with the use of PCA is extremely important for this purpose.

In [59]:
df.corr()
Out[59]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
compactness 1.000000 0.689885 0.789955 0.688130 0.090557 0.150369 0.814026 -0.788051 0.814227 0.674902 0.764386 0.820240 0.581405 -0.258437 0.231648 0.168384 0.296195 0.372806
circularity 0.689885 1.000000 0.797704 0.623950 0.155023 0.251619 0.858149 -0.825108 0.856137 0.965366 0.806108 0.850932 0.935594 0.049070 0.141726 -0.001975 -0.113902 0.049331
distance_circularity 0.789955 0.797704 1.000000 0.771404 0.163386 0.265591 0.909023 -0.912713 0.897261 0.773459 0.865683 0.891789 0.705689 -0.238145 0.110280 0.277851 0.145258 0.343228
radius_ratio 0.688130 0.623950 0.771404 1.000000 0.667375 0.452460 0.743470 -0.795761 0.716210 0.570478 0.806788 0.731773 0.544636 -0.175348 0.044693 0.178079 0.375591 0.470895
pr.axis_aspect_ratio 0.090557 0.155023 0.163386 0.667375 1.000000 0.652093 0.113696 -0.191193 0.086992 0.133553 0.290375 0.100668 0.135663 0.173060 -0.059244 -0.040769 0.229702 0.257566
max.length_aspect_ratio 0.150369 0.251619 0.265591 0.452460 0.652093 1.000000 0.171445 -0.183242 0.167514 0.309180 0.331124 0.150069 0.197179 0.308329 0.016461 0.041210 -0.030543 0.139283
scatter_ratio 0.814026 0.858149 0.909023 0.743470 0.113696 0.171445 1.000000 -0.973413 0.991992 0.808154 0.950067 0.996396 0.795748 -0.045632 0.070118 0.227375 0.009967 0.138424
elongatedness -0.788051 -0.825108 -0.912713 -0.795761 -0.191193 -0.183242 -0.973413 1.000000 -0.950345 -0.771099 -0.937846 -0.956858 -0.761563 0.119750 -0.046621 -0.201325 -0.117568 -0.233408
pr.axis_rectangularity 0.814227 0.856137 0.897261 0.716210 0.086992 0.167514 0.991992 -0.950345 1.000000 0.811979 0.935653 0.992119 0.792895 -0.033769 0.078701 0.231171 -0.017383 0.117202
max.length_rectangularity 0.674902 0.965366 0.773459 0.570478 0.133553 0.309180 0.808154 -0.771099 0.811979 1.000000 0.744760 0.796230 0.865240 0.031098 0.130356 0.013045 -0.108019 0.086792
scaled_variance 0.764386 0.806108 0.865683 0.806788 0.290375 0.331124 0.950067 -0.937846 0.935653 0.744760 1.000000 0.947617 0.776051 0.099371 0.034228 0.208966 0.017828 0.104989
scaled_variance.1 0.820240 0.850932 0.891789 0.731773 0.100668 0.150069 0.996396 -0.956858 0.992119 0.796230 0.947617 1.000000 0.791997 -0.037903 0.072393 0.220054 0.011782 0.125118
scaled_radius_of_gyration 0.581405 0.935594 0.705689 0.544636 0.135663 0.197179 0.795748 -0.761563 0.792895 0.865240 0.776051 0.791997 1.000000 0.177284 0.162397 -0.041153 -0.224495 -0.102167
scaled_radius_of_gyration.1 -0.258437 0.049070 -0.238145 -0.175348 0.173060 0.308329 -0.045632 0.119750 -0.033769 0.031098 0.099371 -0.037903 0.177284 1.000000 -0.088109 -0.120600 -0.748668 -0.798810
skewness_about 0.231648 0.141726 0.110280 0.044693 -0.059244 0.016461 0.070118 -0.046621 0.078701 0.130356 0.034228 0.072393 0.162397 -0.088109 1.000000 -0.022611 0.111135 0.098128
skewness_about.1 0.168384 -0.001975 0.277851 0.178079 -0.040769 0.041210 0.227375 -0.201325 0.231171 0.013045 0.208966 0.220054 -0.041153 -0.120600 -0.022611 1.000000 0.077942 0.201286
skewness_about.2 0.296195 -0.113902 0.145258 0.375591 0.229702 -0.030543 0.009967 -0.117568 -0.017383 -0.108019 0.017828 0.011782 -0.224495 -0.748668 0.111135 0.077942 1.000000 0.894057
hollows_ratio 0.372806 0.049331 0.343228 0.470895 0.257566 0.139283 0.138424 -0.233408 0.117202 0.086792 0.104989 0.125118 -0.102167 -0.798810 0.098128 0.201286 0.894057 1.000000

From the Correlation matrix it can be seen that the variables scatter ratio,elongatedness,pr.axis_rectangularity,scaled variance and scaled variance.1 has a correlation of over 0.93 among each other including some of them that have a correlation of above 0.99.It seems that only one of the attributes can represent them within data set within negligible loss of information.Let us call this set of 5 attributes as HACS 1 (High Attribute Correlation Set).

Apart from the columns in HACS,we see high correlation between circularity and max. length rectangularity as well as scaled radius of gyration and circularity.There are correlation of above 0.86 among the three of them.Let us call them HACS 2.

We also can see that some of the attributes of HACS 1 and HACS 2 considerable correlation with each other.

In such a way we can eliminate a lot of attributes more intutive EDA or by using PCA.

In [60]:
sns.boxplot(x= 'class',y = 'compactness',data = df)
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0xcfb7c50>
In [61]:
df.groupby(['class']).describe()['compactness']
Out[61]:
count mean std min 25% 50% 75% max
class
bus 205.0 91.478049 8.515600 78.0 85.0 89.0 98.0 114.0
car 413.0 96.244552 8.776112 73.0 89.0 97.0 104.0 119.0
van 195.0 90.466667 3.799439 82.0 88.0 90.0 93.0 100.0
In [62]:
sns.boxplot(x= 'class',y = 'circularity',data = df)
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x186f7df0>
In [63]:
sns.boxplot(x= 'class',y = 'distance_circularity',data = df)
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b3a3cd0>
In [64]:
sns.boxplot(x= 'class',y = 'radius_ratio',data = df)
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3d8d70>
In [65]:
sns.boxplot(x= 'class',y = 'pr.axis_aspect_ratio',data = df)
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e40dab0>
In [66]:
sns.boxplot(x= 'class',y = 'max.length_aspect_ratio',data = df)
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e457630>
In [67]:
sns.boxplot(x= 'class',y = 'scatter_ratio',data = df)
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e4aa690>
In [68]:
sns.boxplot(x= 'class',y = 'max.length_rectangularity',data = df)
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e4fbf50>
In [69]:
sns.boxplot(x= 'class',y = 'scaled_radius_of_gyration',data = df)
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e555c70>
In [70]:
sns.boxplot(x= 'class',y = 'scaled_radius_of_gyration.1',data = df)
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e5b4b90>
In [71]:
sns.boxplot(x= 'class',y = 'skewness_about',data = df)
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e602390>
In [72]:
sns.boxplot(x= 'class',y = 'skewness_about.1',data = df)
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b44cc70>
In [73]:
sns.boxplot(x= 'class',y = 'skewness_about.2',data = df)
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e5fbdb0>
In [74]:
sns.boxplot(x= 'class',y = 'hollows_ratio',data = df)
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e6f1c90>

Q3. Standardize the data

Since the dimensions of the data are not really known to us, it would be wise to standardize the data using z scores before we

go for any clustering methods. You can use zscore function to do this

In [75]:
interest_df = df.drop('class',axis=1)
In [76]:
target_df = df.pop("class")
In [77]:
from scipy.stats import zscore
interest_df_z = interest_df.apply(zscore)
In [78]:
interest_df_z.head()
Out[78]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.163231 0.520408 0.060669 0.264970 1.283254 0.299721 -0.198517 0.129648 -0.217151 0.766312 -0.397397 -0.339014 0.301676 -0.321192 -0.071523 0.371287 -0.321809 0.171837
1 -0.322874 -0.619123 0.124067 -0.836393 -0.599253 0.085785 -0.591720 0.514333 -0.606014 -0.337462 -0.590034 -0.618754 -0.502972 -0.053505 0.538425 0.147109 0.003400 0.442318
2 1.256966 0.845988 1.518823 1.187734 0.530251 0.299721 1.162569 -1.152637 0.949438 0.697326 1.111591 1.122486 1.415804 0.080339 1.555006 -0.413338 -0.159204 0.036596
3 -0.079822 -0.619123 -0.002729 -0.300595 0.153750 0.085785 -0.742952 0.642562 -0.606014 -0.337462 -0.911095 -0.738643 -1.462359 -1.258099 -0.071523 -0.301249 1.629444 1.524243
4 -1.052030 -0.130753 -0.763506 1.068668 5.173770 9.285029 -0.591720 0.514333 -0.606014 -0.268476 1.689501 -0.647299 0.425468 7.307905 0.538425 -0.189159 -1.460039 -1.721531
In [79]:
from sklearn.preprocessing import StandardScaler
import numpy as np
sc = StandardScaler()
X_std = sc.fit_transform(interest_df) 
In [80]:
y = target_df.replace({'car':1,'bus':2,'van':3})
In [81]:
y.head()
Out[81]:
0    3
1    3
2    1
3    3
4    2
Name: class, dtype: int64
In [82]:
X_std[:,:]
Out[82]:
array([[ 0.16323063,  0.52040788,  0.06066872, ...,  0.37128716,
        -0.3218087 ,  0.17183708],
       [-0.32287376, -0.61912319,  0.12406675, ...,  0.14710858,
         0.00340009,  0.44231829],
       [ 1.2569655 ,  0.84598818,  1.51882349, ..., -0.41333788,
        -0.1592043 ,  0.03659647],
       ...,
       [ 1.5000177 ,  1.49714879,  1.20183332, ..., -0.97378433,
        -0.3218087 ,  0.7127995 ],
       [-0.93050425, -1.43307395, -0.25632145, ...,  1.38009078,
         0.16600449, -0.09864413],
       [-1.05203035, -1.43307395, -1.01709784, ...,  0.59546574,
        -0.4844131 , -0.77484716]])
In [83]:
X_std.shape
Out[83]:
(813, 18)
In [84]:
rf__covMatrix = np.cov(X_std,rowvar=False)
print(covMatrix)
[[ 1.00123153  0.69073497  0.79092746  0.68897729  0.09066804  0.1505537
   0.81502868 -0.78902127  0.81522961  0.67573322  0.76532752  0.82125027
   0.58212123 -0.25875528  0.23193313  0.16859183  0.29656022  0.3732647 ]
 [ 0.69073497  1.00123153  0.79868656  0.62471862  0.15521415  0.25192897
   0.85920548 -0.8261242   0.85719089  0.96655501  0.80710097  0.85197956
   0.93674669  0.0491303   0.1419004  -0.00197723 -0.1140426   0.04939203]
 [ 0.79092746  0.79868656  1.00123153  0.77235395  0.16358689  0.26591783
   0.91014241 -0.9138366   0.89836576  0.77441118  0.86674929  0.89288735
   0.70655787 -0.23843852  0.11041593  0.278193    0.14543699  0.34365085]
 [ 0.68897729  0.62471862  0.77235395  1.00123153  0.66819724  0.45301698
   0.74438595 -0.79674104  0.71709175  0.57118076  0.80778118  0.73267385
   0.54530637 -0.17556405  0.04474816  0.17829807  0.37605357  0.47147529]
 [ 0.09066804  0.15521415  0.16358689  0.66819724  1.00123153  0.6528959
   0.11383635 -0.19142882  0.08709873  0.13371753  0.29073296  0.10079166
   0.1358303   0.17327362 -0.05931667 -0.04081886  0.22998448  0.25788318]
 [ 0.1505537   0.25192897  0.26591783  0.45301698  0.6528959   1.00123153
   0.17165622 -0.18346816  0.16772014  0.30956088  0.33153186  0.15025349
   0.19742152  0.30870835  0.01648166  0.04126053 -0.03058065  0.13945419]
 [ 0.81502868  0.85920548  0.91014241  0.74438595  0.11383635  0.17165622
   1.00123153 -0.97461169  0.99321402  0.80914895  0.95123751  0.99762263
   0.79672833 -0.04568825  0.07020422  0.22765518  0.00997918  0.13859428]
 [-0.78902127 -0.8261242  -0.9138366  -0.79674104 -0.19142882 -0.18346816
  -0.97461169  1.00123153 -0.95151538 -0.7720487  -0.93900061 -0.95803596
  -0.76250122  0.11989713 -0.04667832 -0.20157325 -0.11771267 -0.23369509]
 [ 0.81522961  0.85719089  0.89836576  0.71709175  0.08709873  0.16772014
   0.99321402 -0.95151538  1.00123153  0.81297896  0.93680528  0.99334122
   0.79387149 -0.0338105   0.07879827  0.23145534 -0.01740457  0.11734664]
 [ 0.67573322  0.96655501  0.77441118  0.57118076  0.13371753  0.30956088
   0.80914895 -0.7720487   0.81297896  1.00123153  0.74567768  0.79721018
   0.86630563  0.03113609  0.1305165   0.01306069 -0.1081518   0.08689924]
 [ 0.76532752  0.80710097  0.86674929  0.80778118  0.29073296  0.33153186
   0.95123751 -0.93900061  0.93680528  0.74567768  1.00123153  0.94878385
   0.77700696  0.09949344  0.03427042  0.20922369  0.01785045  0.10511871]
 [ 0.82125027  0.85197956  0.89288735  0.73267385  0.10079166  0.15025349
   0.99762263 -0.95803596  0.99334122  0.79721018  0.94878385  1.00123153
   0.79297189 -0.03794995  0.07248206  0.22032505  0.01179646  0.12527211]
 [ 0.58212123  0.93674669  0.70655787  0.54530637  0.1358303   0.19742152
   0.79672833 -0.76250122  0.79387149  0.86630563  0.77700696  0.79297189
   1.00123153  0.17750261  0.1625974  -0.04120413 -0.22477134 -0.10229307]
 [-0.25875528  0.0491303  -0.23843852 -0.17556405  0.17327362  0.30870835
  -0.04568825  0.11989713 -0.0338105   0.03113609  0.09949344 -0.03794995
   0.17750261  1.00123153 -0.08821744 -0.12074877 -0.74958968 -0.7997942 ]
 [ 0.23193313  0.1419004   0.11041593  0.04474816 -0.05931667  0.01648166
   0.07020422 -0.04667832  0.07879827  0.1305165   0.03427042  0.07248206
   0.1625974  -0.08821744  1.00123153 -0.02263933  0.11127169  0.0982493 ]
 [ 0.16859183 -0.00197723  0.278193    0.17829807 -0.04081886  0.04126053
   0.22765518 -0.20157325  0.23145534  0.01306069  0.20922369  0.22032505
  -0.04120413 -0.12074877 -0.02263933  1.00123153  0.07803801  0.20153412]
 [ 0.29656022 -0.1140426   0.14543699  0.37605357  0.22998448 -0.03058065
   0.00997918 -0.11771267 -0.01740457 -0.1081518   0.01785045  0.01179646
  -0.22477134 -0.74958968  0.11127169  0.07803801  1.00123153  0.89515759]
 [ 0.3732647   0.04939203  0.34365085  0.47147529  0.25788318  0.13945419
   0.13859428 -0.23369509  0.11734664  0.08689924  0.10511871  0.12527211
  -0.10229307 -0.7997942   0.0982493   0.20153412  0.89515759  1.00123153]]
In [85]:
from sklearn.decomposition import PCA
pca = PCA(n_components=18)
pca.fit(X_std)
Out[85]:
PCA(copy=True, iterated_power='auto', n_components=18, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

Eigen Values

In [86]:
print(pca.explained_variance_)
[9.45338700e+00 2.98961888e+00 1.91768721e+00 1.17011696e+00
 9.29094522e-01 5.32171101e-01 3.59073770e-01 2.22360825e-01
 1.56093142e-01 9.28191209e-02 6.35293025e-02 4.43920604e-02
 3.47623783e-02 2.12103219e-02 1.61152840e-02 1.31079986e-02
 6.26175180e-03 3.65863172e-04]

Eigen Vectors

In [87]:
print(pca.components_)
[[ 2.74447428e-01  2.94003600e-01  3.04380218e-01  2.68888600e-01
   8.30199914e-02  9.84825471e-02  3.16688948e-01 -3.13205048e-01
   3.13612229e-01  2.81285672e-01  3.09161565e-01  3.14163506e-01
   2.70337322e-01 -2.56036923e-02  3.96799548e-02  6.31600075e-02
   3.09410342e-02  7.93661290e-02]
 [-1.27105989e-01  1.34430321e-01 -7.21448351e-02 -1.76416250e-01
  -9.87633531e-02  3.02068515e-02  4.43899402e-02  1.52539710e-02
   5.72306050e-02  1.20324381e-01  6.19096771e-02  4.80210991e-02
   2.10169704e-01  4.93793797e-01 -5.62601909e-02 -1.21035426e-01
  -5.44491703e-01 -5.38881650e-01]
 [-1.15778231e-01 -3.64513515e-02 -5.51881577e-02  2.81804540e-01
   6.45768877e-01  5.86412351e-01 -9.85848213e-02  5.66515157e-02
  -1.12039253e-01 -2.41324720e-02  5.97234736e-02 -1.09452782e-01
  -3.70630986e-02  2.75779539e-01 -1.10191782e-01 -8.04993535e-02
   3.17280047e-02  5.69462532e-02]
 [ 8.00766389e-02  1.90342131e-01 -6.93709791e-02 -4.46505645e-02
   3.00532206e-02  2.97502955e-02 -9.44172353e-02  8.50674367e-02
  -9.18974234e-02  1.92293894e-01 -1.19475682e-01 -9.13175862e-02
   2.04886762e-01 -7.15330519e-02  6.05082831e-01 -6.62058494e-01
   1.01853490e-01  5.15794859e-02]
 [ 7.01971756e-02 -8.66726774e-02  3.89590342e-02 -4.36633252e-02
  -3.84681508e-02  2.12001001e-01 -1.70159521e-02  7.58227573e-02
   7.39329767e-04 -6.42432032e-02  2.56413035e-03 -1.94434681e-02
  -6.31950148e-02  1.49126516e-01  7.29057797e-01  5.99761520e-01
  -9.41079552e-02 -2.88601711e-02]
 [ 1.41269187e-01 -2.78132128e-01 -1.36322721e-01  2.55012111e-01
   2.37902011e-01 -4.38916094e-01  1.16490485e-01 -1.47678917e-01
   9.15033641e-02 -4.62568841e-01  2.30563287e-01  1.52146030e-01
  -1.33617264e-01  2.32402344e-01  2.06512153e-01 -1.94570542e-01
   1.46274675e-01 -2.49058960e-01]
 [ 4.78882430e-01 -2.32112671e-01  6.01737282e-02 -1.69695187e-01
  -3.85740888e-01  4.85306492e-01  6.58373536e-02  1.50506530e-02
   9.77300277e-02 -1.07288254e-01  1.14298332e-01  8.22431682e-02
  -3.91191664e-01  1.17229443e-01 -7.73337339e-02 -2.85755608e-01
   1.40731375e-02  1.61871702e-04]
 [-5.51289488e-01 -1.77146565e-01  4.36495932e-01  9.74234227e-02
  -7.62877575e-02  1.73621590e-01  1.02868451e-01 -2.16260508e-01
   7.04346329e-02 -2.51590945e-01  4.88441109e-02  3.60905529e-02
  -1.24747692e-01 -3.40135356e-01  1.56634679e-01 -2.13789724e-01
  -3.11235076e-01 -3.24817676e-02]
 [-4.70291162e-01 -8.14934003e-03 -1.76997855e-01 -2.23935334e-01
  -2.97909278e-01  1.58406976e-01  6.59439525e-02 -1.67628175e-01
   1.48114965e-02 -9.46685154e-02  3.01488163e-01  7.40065725e-02
   2.41941897e-01  3.21737825e-01  2.20208315e-02 -6.04987464e-03
   5.03554117e-01  1.71175492e-01]
 [-2.71011706e-01  8.72464933e-02 -2.14799970e-01 -5.59954560e-02
   1.12777107e-01 -1.23445271e-01  1.69547227e-01 -1.45780526e-01
   2.02572760e-01  4.77073545e-01 -1.19097512e-01  1.52179649e-01
  -6.71018683e-01  1.32621729e-01  1.00708823e-01 -3.43286369e-02
   2.17245830e-02  7.80097572e-02]
 [ 3.88480982e-02 -1.10855608e-02  7.02546566e-01 -1.18842781e-01
   3.62581375e-02 -2.61997629e-01 -1.58594849e-01 -5.47593166e-02
  -2.71336756e-01  1.54729824e-01  9.32732969e-02 -2.46983541e-01
  -1.52481306e-01  4.18580991e-01 -1.21985888e-02 -3.52770922e-02
   1.27446973e-01  9.59270378e-02]
 [-3.37118452e-02  1.12606467e-01  2.85482662e-02  1.28208356e-01
  -7.77877534e-02  1.32539287e-01 -1.00862500e-01 -7.71897833e-02
  -2.11908632e-01  2.00106636e-01  1.86145164e-01 -1.26013043e-01
  -1.54687992e-01 -2.96651486e-01  2.86078067e-04  8.44396504e-02
   3.97342182e-01 -7.18862391e-01]
 [-1.68590056e-01  6.30087058e-02  2.16299609e-01  9.69337723e-02
  -1.61980790e-02 -4.89580300e-02 -2.84033960e-02  8.31773188e-01
   2.82090175e-01  5.45386339e-03  1.60924768e-01  2.86608069e-01
  -5.63783601e-02  1.01189623e-02 -8.06263453e-03 -3.79278980e-02
   1.43127328e-01 -5.65465154e-02]
 [-4.08522941e-02  1.60416148e-01 -2.13651022e-01  5.77453359e-01
  -3.80214700e-01 -7.29902818e-02 -1.01546016e-01  7.90205379e-02
  -2.95199956e-01  6.38562534e-02  3.95136488e-01 -1.50954509e-01
  -1.46437079e-01  6.94828591e-02  2.22549225e-02 -1.70270478e-02
  -2.67301086e-01  2.41670167e-01]
 [ 3.33699383e-02 -4.20446984e-01 -1.19627607e-01 -3.63740754e-01
   2.44664158e-01 -8.06600957e-02 -7.06019867e-02  6.38409127e-02
  -4.95392297e-02  3.13768524e-01  6.21222770e-01 -4.19518767e-02
   7.10362584e-02 -2.54533399e-01  2.86625550e-02 -8.72192274e-03
  -1.84705983e-01  8.14613423e-02]
 [ 4.49941671e-02  6.57048309e-01 -2.21685599e-02 -3.87244720e-01
   1.96258041e-01  2.25960880e-03  1.64936005e-02 -5.69023288e-04
  -2.00345897e-01 -4.02768408e-01  2.74699416e-01  1.10399001e-01
  -2.45679952e-01 -1.20174593e-01  1.14302731e-02  1.35201740e-02
  -9.08486701e-02  6.03519942e-02]
 [-8.72862091e-03 -1.69794331e-01  2.86460253e-02  5.58540324e-03
   6.76296225e-03  2.24680847e-02  3.60489964e-01  9.20983495e-02
  -6.96385259e-01  9.88667659e-02 -1.45949504e-01  5.58796482e-01
   5.39116874e-02  7.13365890e-03  5.49332669e-04 -1.11097291e-03
  -3.84346601e-03 -8.55457184e-04]
 [-4.27500161e-04 -1.73650656e-02  1.13772044e-02  3.08910335e-02
  -2.56226259e-02  9.17782032e-03 -7.97232421e-01 -2.18434847e-01
   2.10132827e-02  2.88936428e-02 -3.14878700e-02  5.57166843e-01
  -6.29343908e-03  1.14105168e-02  4.06288415e-03  6.90089495e-03
  -3.72910520e-02  1.59113471e-02]]
In [88]:
print(pca.explained_variance_ratio_)
[5.24542179e-01 1.65885645e-01 1.06407135e-01 6.49265390e-02
 5.15528736e-02 2.95286958e-02 1.99240058e-02 1.23381844e-02
 8.66117477e-03 5.15027513e-03 3.52506448e-03 2.46319209e-03
 1.92886778e-03 1.17690183e-03 8.94192335e-04 7.27326420e-04
 3.47447209e-04 2.03007309e-05]
In [89]:
np.cumsum(pca.explained_variance_ratio_)
Out[89]:
array([0.52454218, 0.69042782, 0.79683496, 0.8617615 , 0.91331437,
       0.94284307, 0.96276707, 0.97510526, 0.98376643, 0.98891671,
       0.99244177, 0.99490496, 0.99683383, 0.99801073, 0.99890493,
       0.99963225, 0.9999797 , 1.        ])

We can see from the principle component analysis we have done that the first seven component itself will cover more than 95% variance of the data.So,we can get sufficiently accurate results by just checking the first seven components itself.A step plot depicting the same is given in the below.

In [90]:
plt.step(list(range(0,18)),np.cumsum(pca.explained_variance_ratio_), where='post')
plt.ylabel('Cum of variation explained')
plt.xlabel('Eigen Value')
plt.show()
In [91]:
pca95 = PCA(n_components=7)
pca95.fit(X_std)
Out[91]:
PCA(copy=True, iterated_power='auto', n_components=7, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [92]:
print(pca95.explained_variance_)
[9.453387   2.98961888 1.91768721 1.17011696 0.92909452 0.5321711
 0.35907377]
In [93]:
print(pca95.components_)
[[ 2.74447428e-01  2.94003600e-01  3.04380218e-01  2.68888600e-01
   8.30199914e-02  9.84825471e-02  3.16688948e-01 -3.13205048e-01
   3.13612229e-01  2.81285672e-01  3.09161565e-01  3.14163506e-01
   2.70337322e-01 -2.56036923e-02  3.96799548e-02  6.31600075e-02
   3.09410342e-02  7.93661290e-02]
 [-1.27105989e-01  1.34430321e-01 -7.21448351e-02 -1.76416250e-01
  -9.87633531e-02  3.02068515e-02  4.43899402e-02  1.52539710e-02
   5.72306050e-02  1.20324381e-01  6.19096771e-02  4.80210991e-02
   2.10169704e-01  4.93793797e-01 -5.62601909e-02 -1.21035426e-01
  -5.44491703e-01 -5.38881650e-01]
 [-1.15778231e-01 -3.64513515e-02 -5.51881577e-02  2.81804540e-01
   6.45768877e-01  5.86412351e-01 -9.85848213e-02  5.66515157e-02
  -1.12039253e-01 -2.41324720e-02  5.97234736e-02 -1.09452782e-01
  -3.70630986e-02  2.75779539e-01 -1.10191782e-01 -8.04993535e-02
   3.17280047e-02  5.69462532e-02]
 [ 8.00766389e-02  1.90342131e-01 -6.93709791e-02 -4.46505645e-02
   3.00532206e-02  2.97502955e-02 -9.44172353e-02  8.50674367e-02
  -9.18974234e-02  1.92293894e-01 -1.19475682e-01 -9.13175862e-02
   2.04886762e-01 -7.15330519e-02  6.05082831e-01 -6.62058494e-01
   1.01853490e-01  5.15794859e-02]
 [ 7.01971756e-02 -8.66726774e-02  3.89590342e-02 -4.36633252e-02
  -3.84681508e-02  2.12001001e-01 -1.70159521e-02  7.58227573e-02
   7.39329767e-04 -6.42432032e-02  2.56413035e-03 -1.94434681e-02
  -6.31950148e-02  1.49126516e-01  7.29057797e-01  5.99761520e-01
  -9.41079552e-02 -2.88601711e-02]
 [ 1.41269187e-01 -2.78132128e-01 -1.36322721e-01  2.55012111e-01
   2.37902011e-01 -4.38916094e-01  1.16490485e-01 -1.47678917e-01
   9.15033641e-02 -4.62568841e-01  2.30563287e-01  1.52146030e-01
  -1.33617264e-01  2.32402344e-01  2.06512153e-01 -1.94570542e-01
   1.46274675e-01 -2.49058960e-01]
 [ 4.78882430e-01 -2.32112671e-01  6.01737282e-02 -1.69695187e-01
  -3.85740888e-01  4.85306492e-01  6.58373536e-02  1.50506530e-02
   9.77300277e-02 -1.07288254e-01  1.14298332e-01  8.22431682e-02
  -3.91191664e-01  1.17229443e-01 -7.73337339e-02 -2.85755608e-01
   1.40731375e-02  1.61871702e-04]]
In [94]:
np.cumsum(pca95.explained_variance_ratio_)
Out[94]:
array([0.52454218, 0.69042782, 0.79683496, 0.8617615 , 0.91331437,
       0.94284307, 0.96276707])

Use Support vector machines and use grid search (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy. (10 points)

In [95]:
# Splitting into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size = .30, random_state=0)

For the process of doing support vector Classifier (SVC) after principle component analysis(PCA),we are creating a pipeline and using number of components as 7 for the PCA and gamma =0.025 and C=3 as a trial for our pipeline.

In [107]:
target_names=['car','bus','van']

# Creating a pipeline
from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA 
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline
from sklearn import svm

pipe_trial = Pipeline([ ('pca', PCA(n_components=7)), ('clf', svm.SVC(gamma=0.025 , C =3))]) 
pipe_trial.fit(X_train, y_train) 
print('Train Accuracy: %.3f' % pipe_trial.score(X_train, y_train))
Train Accuracy: 0.921

Now for tuning the hyper parameters and cross validation,we will create a pipeline pipe_svc and with the grid parameters for PCA as 7 and 8 components and SVC with parameter as 0.01,0.05,0.5 and 1 and kernal type used in the model as both rbf and linear in Grid Search Cross Validation with the number of folds as 10.

In [109]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC 
from sklearn.metrics import classification_report

pipe_svc = Pipeline([ ('pca', PCA()), ('svc', SVC())]) 


param_grid = {'pca__n_components':[7,8],'svc__C': [0.01, 0.05, 0.5, 1], 'svc__kernel':['rbf','linear']} 
grid_svc = GridSearchCV( pipe_svc , param_grid = param_grid, cv = 10) 
grid_svc.fit( X_train, y_train) 

print(" Best cross-validation accuracy: {:.2f}". format( grid_svc.best_score_)) 
print(" Best parameters: ", grid_svc.best_params_) 

y_pipe_svc_predict = grid_svc.predict(X_test)
print(" Test set accuracy: {:.2f}". format( grid_svc.score( X_test, y_test)))

print(classification_report(y_test, y_pipe_svc_predict, target_names=target_names))
 Best cross-validation accuracy: 0.94
 Best parameters:  {'pca__n_components': 8, 'svc__C': 1, 'svc__kernel': 'rbf'}
 Test set accuracy: 0.93
              precision    recall  f1-score   support

         car       0.92      0.96      0.94       127
         bus       0.96      1.00      0.98        51
         van       0.93      0.83      0.88        66

    accuracy                           0.93       244
   macro avg       0.94      0.93      0.93       244
weighted avg       0.93      0.93      0.93       244

It was seen that the best cross validation accuracy was obtained as 0.94 with the parameters number of PCA components as 8,C parameter in SVC as 1 and Radial Basis Function kernel type used for the algorithm .

From the classification report,it can be observed the prediction of the bus is most accurate as there is clear distinction the size of the bus in comparison to that of the car and van which are more similar to each other.The recall metric of 1 can clearly indicate the ability of the model to correctly classify the buses.

It is also seen that the among the car and van,the van is often misclassified as a car due to its similarity in size with the car.The unbalance between the data classes making car a majority class further exacerbate the recall metric of the van class making the lowest recall of 0.83 which all other classes have a recall above 0.95.

In [110]:
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB

# Instantiate the pipeline for PCA and Random Forest Classifier
pipe_naive = Pipeline([ ('pca', PCA()), ('naive', GaussianNB())]) 

# Create the parameter grid based on the results of random search 
param_grid = {
    'pca__n_components':[7,8]
}



# Instantiate the grid search model
grid_naive = GridSearchCV(estimator = pipe_naive, param_grid = param_grid, cv = 10)

grid_naive.fit( X_train, y_train) 

print(" Best cross-validation accuracy: {:.2f}". format( grid_naive.best_score_)) 
print(" Best parameters: ", grid_naive.best_params_) 

y_pipe_naive_predict = grid_naive.predict(X_test)


print(" Test set accuracy: {:.2f}". format( grid_naive.score( X_test, y_test)))

print(classification_report(y_test, y_pipe_naive_predict, target_names=target_names))
 Best cross-validation accuracy: 0.80
 Best parameters:  {'pca__n_components': 8}
 Test set accuracy: 0.77
              precision    recall  f1-score   support

         car       0.76      0.92      0.84       127
         bus       0.78      0.71      0.74        51
         van       0.76      0.52      0.61        66

    accuracy                           0.77       244
   macro avg       0.77      0.71      0.73       244
weighted avg       0.77      0.77      0.76       244

The best cross validation accuracy obtained was 0.80 with number of comeponents as 8 in PCA and a test accuracy of 0.77 which is quite less compared to the Support Vector Classifier which above 0.9 accuracies in test as well as cross validation.

From the classification report,it can be observed the prediction of the car is most accurate as the unbalance of number of data points within each classes favours the majority class in the naive bayes algoritm inspite of the clear distinction the size of the bus in comparison to that of the car and van. Probabaly using upsampling,downsampling or SMOTE in Imbalanced learning library will lead to better results as even in the part the performance of recall parameter in imbalanced target classes have created problem while suing the

It is also seen that the among the car and van,the van is often misclassified as a car due to its similarity in size with the car which is same as seen with the support vector classifier.

In [100]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Instantiate the pipeline for PCA and Random Forest Classifier
pipe_rf = Pipeline([ ('pca', PCA()), ('rf', RandomForestClassifier())]) 

# Create the parameter grid based on the results of random search 
param_grid = {
    'pca__n_components':[7],
    'rf__bootstrap': [True],
    'rf__max_depth': [3, 4, 5, 6],
    'rf__max_features': [4, 5],
    'rf__min_samples_leaf': [3, 4, 5],
    'rf__min_samples_split': [8, 10, 12],
    'rf__n_estimators': [10, 20, 30, 100]
}



# Instantiate the grid search model
grid_rf = GridSearchCV(estimator = pipe_rf, param_grid = param_grid, cv = 10)

grid_rf.fit( X_train, y_train) 

print(" Best cross-validation accuracy: {:.2f}". format( grid_rf.best_score_)) 
print(" Best parameters: ", grid_rf.best_params_) 
print(" Test set accuracy: {:.2f}". format( grid_rf.score( X_test, y_test)))
 Best cross-validation accuracy: 0.86
 Best parameters:  {'pca__n_components': 7, 'rf__bootstrap': True, 'rf__max_depth': 6, 'rf__max_features': 4, 'rf__min_samples_leaf': 4, 'rf__min_samples_split': 10, 'rf__n_estimators': 100}
 Test set accuracy: 0.85

The code for the random forest was taking a long time for the run and didnt give accuracies even near to that of Support Vector Classifier with the best cross validation accuracy as 0.86 at the paramters given and the test accuracy of 0.85..So,the RandomForest is effective algorithm for this dataset.